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Abstract 

We  develop  some  new  error  bounds  for  learning  algorithms  induced  by  regularization 
methods  in  the  regression  setting.  The  “hardness”  of  the  problem  is  characterized  in  terms 
of  the  parameters  r  and  s,  the  first  related  to  the  “complexity”  of  the  target  function, 
the  second  connected  to  the  effective  dimension  of  the  marginal  probability  measure  over 
the  input  space.  We  show,  extending  previous  results,  that  by  a  suitable  choice  of  the 
regularization  parameter  as  a  function  of  the  number  of  the  available  examples,  it  is 
possible  attain  the  optimal  minimax  rates  of  convergence  for  the  expected  squared  loss  of 
the  estimators,  over  the  family  of  priors  fulfilling  the  constraint  r  +  s  >  \ .  The  setting 
considers  both  labelled  and  unlabelled  examples,  the  latter  being  crucial  for  the  optimality 
results  on  the  priors  in  the  range  r  <  |. 
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1.  Introduction 


We  consider  the  setting  of  semi-supervised  statistical  learning.  We  assume  Y  C 
[— M,  M]  and  the  supervised  part  of  the  training  set  equal  to 

Z  —  ,  Zm ) , 

with  Zi  =  ( Xi,yt )  drawn  i.i.d.  according  to  the  probability  measure  p  over  Z  =  X  x  Y. 
Moreover  consider  the  unsupervised  part  of  the  training  set  (x^+ii  •  •  • ,  with  drawn 
i.i.d.  according  to  the  marginal  probability  measure  over  A',  px-  For  sake  of  brevity  we 
will  also  introduce  the  complete  training  set 

z  =  (zi, . . . ,  Zm), 

with  Zi  =  where  we  introduced  the  compact  notations  Xi  and  jji,  defined  by 

if  1  <  i  <  m, 
if  m  <  i  <  m, 

if  1  <  i  <  m, 
if  m  <  i  <  rh. 

It  is  clear  that,  in  the  supervised  setting,  the  semi-supervised  part  of  the  training  set 
is  missing,  whence  m  =  m  and  z  =  z. 

In  the  following  we  will  study  the  generalization  properties  of  a  class  of  estimators  fzt x 
belonging  to  the  hypothesis  space  Tt\  the  RKHS  of  functions  on  X  induced  by  the  bounded 
Mercer  kernel  K  (in  the  following  k  =  sup^g^  K(x,x)).  The  learning  algorithms  that  we 
consider,  have  the  general  form 


and 


Xi  = 


Xi 

X? 


Vi 


*  { 


(1)  /z.A  =  Ga(Tx)  g„ 

where  T*  £  £(H)  is  given  by, 


Txf  -  Kr.  (Kxi,  f)n 

i= 1 

gz.  G  Tt  is  given  by, 


^  m  ^  m 

Qz.  =  —  ^  ^  Kxijji  —  ^  ^  Kxil/i 

m  ^  m  ^ 


and  the  regularization  parameter  X  lays  in  the  range  (0,  «].  We  will  often  used  the  shortcut 
notation  A  = 

K 

The  functions  G\  :  [0,  k]  — >■  R,  which  select  the  regularization  method,  will  be  charac¬ 
terized  in  terms  of  the  constants  A  and  Br  in  [0,  +oo],  defined  as  follows 


(2) 

A  =  sup 

sup 

^  (cr  +  A)Ga(ct) 

A€(0,k 

]  jG[0,k 

(3) 

Br  =  SUp 

sup 

sup  |1  —  G\{a)o\  r  >  0, 

tG[0,r] 

se 

o 

tu 

[0,«] 

Finiteness  of  A  and  Br  (with  r  over  a  suitable  range)  are  standard  in  the  literature 
of  ill-posed  inverse  problems  (see  for  reference  [12]).  Regularization  methods  have  been 
recently  studied  in  the  context  of  learning  theory  in  [13,  9,  8,  10,  1]. 

The  main  results  of  the  paper,  Theorems  1  and  2,  describe  the  convergence  rates  of 
/z,a  to  the  target  function  fn ■  Here,  the  target  function  is  the  “best”  function  which  can 
be  arbitrarily  well  approximated  by  elements  of  our  hypothesis  space  TL.  More  formally, 
fn  is  the  projection  of  the  regression  function  fp(x)  =  fy  ydp\x(y)  onto  the  closure  of  hi 
in  C2{X,px). 

The  convergence  rates  in  Theorems  1  and  2,  will  be  described  in  terms  of  the  constants 
Cr  and  Da  in  [0,  +oo]  characterizing  the  probability  measure  p.  These  constants  can  be 
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described  in  terms  of  the  integral  operator  Lk  '■  C2(X,px)  — >  G2(X ,px)  of  kernel  K. 
Note  that  the  same  integral  operator  is  denoted  by  T,  when  seen  as  a  bounded  operator 
from  H  to  H. 

The  constants  Cr  characterize  the  conditional  distributions  p\x  through  fn,  they  are 
defined  as  follows 


(4) 


Cr 


{ 


Kr\\L-Kfu  ||p 

Too 


if  fn  €  Im  UK 
if  fn  £  Im  LrK 


r  >  0. 


Finiteness  of  Cr  is  a  common  source  condition  in  the  inverse  problems  literature  (see 
[12]  for  reference).  This  type  of  condition  has  been  introduced  in  the  statistical  learning 
literature  in  [7,  18,  3,  17,  4]. 

The  constants  Ds  characterize  the  marginal  distribution  px  through  the  effective  di¬ 
mension  AT(\)  =  Tr  [T(T  T  A)-1] ,  they  are  defined  as  follows 

(5)  Ds  =  IV  sup  (0,1]. 

as(o,i] 

Finiteness  of  Ds  was  implicitly  assumed  in  [3,  4]. 

The  paper  is  organized  as  follows.  In  Section  2  we  focus  on  the  RLS  estimators  /I(A, 
defined  by  the  optimization  problem 


fi,x  =  argmin  -j-  ^  (/(»*)  ~  Vif  +  A  \\ffK  , 
fen  m ^ 

and  corresponding  to  the  choice  Ga(<t)  =  (cr  T  A)-1  (see  for  example  [5,  7,  18]).  The 
main  result  of  this  Section,  Theorem  1,  extends  the  convergence  analysis  performed  in 
[3,  4]  from  the  range  r  >  |  to  arbitrary  r  >  0  and  s  >  \  —  r.  Corollary  1  gives  optimal 
s-independent  rates  for  r  >  0. 

The  analysis  of  the  RLS  algorithm  is  a  useful  preliminary  step  for  the  study  of  general 
regularization  methods,  which  is  performed  in  Section  3.  The  aim  of  this  Section  is 
develop  a  s-dependent  analysis  in  the  case  r  >  0  for  general  regularization  methods  G\. 
In  Theorem  2  we  extend  the  results  given  in  Theorem  1  to  general  regularization  methods. 
In  fact,  in  Theorem  2  we  obtain  optimal  minimax  rates  of  convergence  (see  [3,  4])  for  the 
involved  problems,  under  the  assumption  that  r  +  s  >  |.  Finally,  Corollary  2  extends 
Corollary  1  to  general  G\. 

In  Sections  4  and  5  we  give  the  proofs  of  the  results  stated  in  the  previous  Sections. 


2.  Risk  bounds  for  RLS. 

We  state  our  main  result  concerning  the  convergence  of  to  fn-  The  function  |*|  +  , 
appearing  in  the  text  of  Theorem  1,  is  the  “positive  part”  of  x,  that  is  x+^  . 

Theorem  1.  Let  r  and  s  be  two  reals  in  the  interval  (0, 1],  fulfilling  the  constraint  r  +  s> 

1 

2  • 

Furthermore,  let  m  and  A  satisfy  the  constraints  A  <  ||T||  and 


(6) 


A  = 


4  Ds  log  ] 


for  5  £  (0,1).  Finally,  assume  m  >  m\  ^  2rl+.  Then,  with  probability  greater  than  1  —  <5 , 
it  holds 


J  z,  A 


fn 


<  4 (M  T  Cr) 


(4DS  log  |  A  2^+ 

V  J 


4 


Some  comments  are  in  order. 

First,  while  eq.  (6)  expresses  A  in  terms  of  m  and  8,  it  is  straightforward  verifying  that 
the  condition  A  <  ||T||  is  satisfied  for 

_  /  K  \r+i  « 

log5- 

Second,  the  asymptotic  rate  of  convergence  O  ^ m~  2r+s  j  of  \\ft,x  —  /h||p,  is  optimal 
in  the  minimax  sense  of  [If,  4].  Indeed,  in  Th.2  of  [4],  it  was  showed  that  this  asymptotic 
order  is  optimal  over  the  class  of  probability  measures  p,  such  that  fn  £  Im  LrK,  and  the 
eigenvalues  of  T,  Ai,  have  asymptotic  order  O  •  In  fact,  the  condition  on  fn  implies 

the  finiteness  of  Cr  and  the  condition  on  the  spectrum  of  T  implies  the  finiteness  of  Da 
(see  Prop. 3  in  [4]). 

Upper  bounds  of  the  type  given  in  [17]  or  [3]  (and  stated  in  [6,  4],  under  a  weaker  noise 
condition,  and  in  the  more  general  framework  of  vector-valued  functions)  can  be  obtained 
as  a  corollary  of  Theorem  1,  considering  the  case  r  > 

However,  the  advantage  of  using  extra  unlabelled  data,  is  evident  when  r  <  |.  In  this 
case,  the  unlabelled  examples  (enforcing  the  assumption  rh  >  mA2r_1)  allow  (if  s  >  |  —  r) 
again  the  rate  of  convergence  O  (jn~  j  ,  over  classes  of  measures  p  defined  in  terms  of 
finiteness  of  the  constants  Cr  and  Da.  It  is  not  known  to  the  author  whether  the  same 
rate  of  convergence  can  be  achieved  by  the  RLS  estimator,  for  s  <  \  —  r. 

A  simple  corollary  of  Theorem  1,  encompassing  all  the  values  of  r  in  (0, 1],  can  be 
obtained  observing  that  Di  =  1,  for  every  kernel  K  and  marginal  distribution  px  (see 
Prop.  2). 


Corollary  1.  Let  rh  >  m\  ^  2r^+  hold  with  r  in  the  interval  (0,1].  If  A  satisfies  the 
constraints  A  <  ||T||  and 


for  8  £  (0, 1),  then,  with  probability  greater  than  1  —  8,  it  holds 


fks,x  ~fn  <  4 (M  +  Cr) 

P 


/41ogf\=% 

\  ) 


3.  Risk  bounds  for  general  regularization  methods. 


In  this  Section  we  state  a  result  which  generalizes  Theorem  1  from  RLS  to  general 
regularization  algorithms  of  type  described  by  equation  (1).  In  this  general  framework 
we  need  (A~l2-2’’-sl+  —  l)m  unlabelled  examples  in  order  to  get  minimax  optimal  rates, 
slightly  more  than  the  (A“^1_2r^+  —  l)m  required  in  Theorem  1  for  the  RLS  estimator. 
We  adopt  the  same  notations  and  definitions  introduced  in  the  previous  section. 


Theorem  2.  Let  r  >  0  and  s  £  (0, 1]  fulfill  the  constraint  r  +  s  >  Furthermore,  let  m 
and  A  satisfy  the  constraints  A  <  ||T||  and 


(7) 


A  =  f±Ds  log  f  A  2r+* 
\  y/m  ) 


for  8  £  (0,  |).  Finally,  assume  rh  >  4  V m\  2r  s^+.  Then,  with  probability  greater  than 
1  —  3h,  it  holds 


\\h,x-fn\\p<Er 


/4Hslogf\  ’’+• 
V  xfm  ) 
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where 


(8)  Er  =  Cr  (30A  +  2(3  +  r)Br  +  1)  +  9MA. 

The  proof  of  the  above  Theorem  is  postponed  to  Section  5. 

For  the  particular  case  G\(a)  =  (a  +  A)-1,  fzt\  =  /~SA  and  the  result  above  can  be 
compared  with  Theorem  1.  In  this  case,  it  is  easy  to  verify  that  A  =  1,  Br  <  1  for  r  £  [0, 1] 
and  Cr  =  +oo  for  r  >  1.  The  maximal  value  of  r  for  which  Cr  <  +oo  is  usually  denoted 
as  the  qualification  of  the  regularization  method. 

For  a  description  of  the  properties  of  common  regularization  methods,  in  the  inverse 
problems  literature  we  refer  to  [12].  In  the  context  of  learning  theory  a  review  of  these 
techniques  can  be  found  in  [10]  and  [1].  In  particular  in  [10]  some  convergence  results  of 
algorithms  induced  by  Lipschitz  continuous  G\  can  be  found. 

A  simple  corollary  of  Theorem  2  which  generalizes  Corollary  1  to  arbitrary  regulariza¬ 
tion  methods,  can  be  obtained  observing  that  D\  =  1,  for  every  kernel  K  and  marginal 
distribution  px  (see  Prop.  2). 

Corollary  2.  Let  m  >  4  V  hold  with  r  >  0.  If  A  satisfies  the  constraints 

A  <  ||T||  and 


A  = 


for  some  6  £  (0,  |),  then,  with  probability  greater  than  1  —  35,  it  holds 


fix  ~fn  <Er 

p 


4  log  f 

\frn 


2  r 

2r+l 


with  Er  defined  by  eq.  (8). 


4.  Proof  of  Theorem  1 

In  this  section  we  give  the  proof  of  Theorem  1.  First  we  need  some  preliminary  propo¬ 
sitions. 

Proposition  1.  Assume  A  <  ||T||  and 

(9)  Am  >  16kA/”(A)  log2 

o 

for  some  S  £  (0, 1).  Then,  with  probability  greater  than  1  —  5,  it  holds 

i  6 

log5’ 

where 

fx  =(T  +  \)-1LKfn- 

Proof.  Assuming 

(10)  Si  :=  ||(T  +  A)-i(T-Tx)(T  +  A)-5||  <1, 

II  II HS 

by  simple  algebraic  computations  we  obtain 


(T  +  X)*(flx-fl)  <8[M  + 
n 


6 


/Ha -/a  =  (Ts  +  A)-15z-(r  +  A)“15 

=  (T*  +  A)"1  {(ffz  —  g)  +  (T  —  T*)(T  +  A)_1g} 

=  (Tft  +  A)_1(T  +  A) 3  {(T  +  A)“3  (g.  -  g)  +  (T  +  A)-*(T  -  T*)(T  +  A)_1p} 
=  (T  + A)-!  {id  -  (T  + A)-5(T-Ti)(T  + A)“5  j_1 
{ (T  +  A) -  s  (5z  -  fl)  +  (T  +  A) -  i  (T  -  Ts)/A  }  . 

Therefore  we  get 

||(t+a)5(/Ha-/H)||  < 

II  llw  1  —  b\ 

where 

&  :=  ||(T  +  A)-hflz-5)||  , 

II  II  n 

S3  :=  II  (T  +  X)~i(T-Tit)fx\\  . 

II  II  n 

Now  we  want  to  estimate  the  quantities  Si,  S2  and  S3  using  Prop.  4.  In  fact,  choosing 
the  correct  vector- valued  random  variables  £1,  £2  and  £3,  the  following  common  represen¬ 
tation  holds, 

,  mh 

Sh  =  - Va(^)-E[a]  ,  h  =  1,2, 3. 

mh  tl 

Indeed,  in  order  to  let  the  equality  above  hold,  £1  :  X  — +  £hs  (TL)  is  defined  by 
6  (*)[*]  =  (T  +  X)-iKx{Kx.,-)n(T  +  X)-i, 

and  mi  =  in. 

Moreover,  £2  :  Z  — >  TL  is  defined  by 

i(x,y)  =  (T  +  X)~^Kxy, 

with  m2  =  m. 

And  finally,  £3  :  X  — >  TL  is  defined  by 

£(z)  =  (T  +  X)---Kxf^(x), 

with  m3  =  m. 

Hence,  applying  three  times  Prop.  4,  we  can  write 


S/j  <  2  (  — -  + 

V  mh  y/mh 


1  61  ^  1  S 

l0®  5  ~l-3’ 


h=  1,2,3, 


where,  as  it  can  be  straightforwardly  verified,  the  constants  Hh  and  ah  are  given  by  the 
expressions 


H  i=2f, 
H2  =  2MJ\, 
H3  =  2  ||  /jf 


erf  =  jAT(X), 

a\  =  M2JV(X), 

<r3  =  k  I  /H I  V( A). 

II  II  n 


n  \/a  ’ 
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Now,  recalling  the  assumptions  on  A  and  rh,  with  probability  greater  than  1  —  <5/3,  we 


get 


Si  <  2  |  + 

mX 


Af(X)n 

fhX 


log5 


Af(X)  > 


\\T\ | 


>1 


Am 


imi+A  -  2, 

(eq.  (9)) 

Hence,  since  rh  >  m,  with  probability  greater  than  1  —  5, 


<  4 

< 


kAT (A)  log2  f  /  nAf (A)  log2  f 


+ 


Am 


1  1  _  3 

4  +  2  ~  4' 


(T  + A)  *(/&-/*)  <  4(52 +  S3) 


<  8(M+x/k^  $ 

m 


2_  k 
\hJ  \  m  V  A 


,  6 

y log  5' 


□ 


Proposition  2.  For  every  probability  measure  px  and  X  >  0,  it  holds 

imi  <  k, 

and 

XJV(X)  <  k. 


Proof.  First,  observe  that 

Tr[T]  =  f  Tr [Kx  (Kx,-)n]dpx(x)  =  f  K(x,x)dpx(x)  <  sup  K(x,x)  <  k. 

J  x  J  x  xex 

Therefore,  since  T  is  a  positive  self-adjoint  operator,  the  first  inequality  follows  observ¬ 
ing  that 


imi  <  Tr[T]  <  k. 

The  second  inequality  can  be  proved  observing  that,  since  rpx(a2)  :=  <  a2,  it 

holds 


XAf(X)  =  Tt[V>a(T)]  <  Tr [T]  <  k. 


□ 


Proposition  3.  Let  fn  £  Im  LrK  for  some  r  >  0.  Then,  the  following  estimates  hold, 

<  Ar  \\Lxrfn\\p  ,  ifr<  1 

<  f  ^i+r\\LKrM\p  ifr<b 

“  1  K,-%r\\L7fn\\p  ifr>\. 

Proof.  The  first  estimate  is  standard  in  the  theory  of  inverse  problems,  see,  for  example, 
[14,  12]  or  [18]. 
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Regarding  the  second  estimate,  if  r  <  \ ,  since  T  is  positive,  we  can  write, 

II /a  II  < 


< 


|(T  +  A)_J+r  (T(T  +  A)-1)  2  /„ 

<  ||(T  +  A)-1p“r  \\L-fn\\p  <  X~^+r  \\L7fn\\f 

On  the  contrary,  if  r  >  since  by  Prop.  2  ||T||  <  k,  we  obtain, 


fx 


<  ||  (T  +  X)~1LKfn\\1 


< 


Tr~1T(T  +  X)~1Ljc  rfn 


<  \\T\r*  \\L-Krfn\\<  k-z  \\L~Krfn\\. 


□ 


We  also  need  the  following  probabilistic  inequality  based  on  a  result  of  [16],  see  also 
Th.  3.3.4  of  [19].  We  report  it  without  proof. 

Proposition  4.  Let  (LI,  LF ,  P)  be  a  probability  space  and  £  be  a  random  variable  on  LI  tak¬ 
ing  value  in  a  real  separable  Hilbert  space  K..  Assume  that  there  are  two  positive  constants 
H  and  a  such  that 

IICMHjc  <  y  a  .s, 


e[||£IIk]  <  ^2, 


then,  for  all  m  £  N  and  0  <  5  <  1, 


(11)  lP(a>1,...,wr„)~P" 


m 

-E^(Wi)-EK] 

m  L ' 


<2(  -  +  -^) 
.  m  y/m  J 


,  2 
l0gh 


>1-5. 


We  are  finally  ready  to  prove  Theorem  1. 


Proof  of  Theorem  1.  The  Theorem  is  a  corollary  of  Prop.  1.  We  proceed  by  steps. 
First.  Observe  that,  by  Prop.  2,  it  holds 

A<S<1. 


Second.  Condition  (9)  holds.  In  fact,  since  A  <  1  and  by  the  assumption  fh  > 
mA“l1_2rb)  we  get, 

Am  >  A-^~2^++1m  >  A 2rm. 

Moreover,  by  eq.  (6)  and  definition  (5),  we  find 

A 2rm  =  16DJA_s  log2  ^  >  16A/”(A)  log2  ^ . 

o  o 

Third.  Since  A  <  1,  recalling  definition  (4)  and  Prop.  3,  for  every  r  in  (0, 1],  we  can 
write, 

II/A-/J  <XrCr, 

II  lip 


K  fk  <X 

H 


i-\l-2r\  +  c2_ 


Therefore  we  can  apply  Prop.  1,  and  using  the  two  estimates  above,  the  assumption 
m  >  mA^1-2rl+  and  the  definition  of  Ds,  to  obtain  the  following  bound, 


ft  a  -  fn 


(ll/llp  =  W^TfWn) 


<  \\ftx-fx\\p  +  Wh-fn\\p 

<  |(T  +  A)M/i%-/A)|L  +IIA-/«L 


<  8(M  +  Cr) 


,  ,  1  (  2  D 

r  H=  + 


-  I  ,  log  v  +  ArCr 

m  V Vrol  Va® / 


(eg.  (6)) 
r  +  s  >  i 


=  2(M  +  Cr)Ar  1  + 


Ar4 


+  Ara 


2-D|  log  | 

<  (3(M  +  Cr)  +  a)  Ar  <  4(M  +  Cr)Ar. 


Substituting  the  expression  (6)  for  A  in  the  inequality  above,  concludes  the  proof.  □ 

5.  Proof  of  Theorem  2 

In  this  section  we  give  the  proof  of  Theorem  2.  It  is  based  on  Proposition  1  which 
establishes  an  upper  bound  on  the  sample  error  for  the  RLS  algorithm  in  terms  of  the 
constants  Cr  and  Ds .  When  need  some  preliminary  results.  Proposition  5  shows  properties 
of  the  truncated  functions  /Ar,  defined  by  equation  (12),  analogous  to  those  given  in 
Proposition  3  for  the  functions  ff- 

Proposition  5.  Let  fn  £  Im  LrK  for  some  r  >  0.  For  any  A  >  0  let  the  truncated 
function  /Ar  be  defined  by 

(12)  fx  =  P\fn 

where  Px  is  the  orthogonal  projector  in  C2(X,px)  defined  by 

(13)  Px  =  0a  (Lx), 
with 

1  if  cr  >  A, 


(14) 


©a  {a)  =  | 


0 


if  cr  <  A. 


Then,  the  following  estimates  hold, 


\\f"-fn\\p  <  Ar  \\Lnrfn\\p, 

A" h+r\\L~KrM\ 


\n  < 


\L7fn\\ 


if  r  <  2- 
ifr  >  A 


Proof.  The  first  estimate  follows  simply  observing  that 

||/a"  ~fn\\p  =  | Pxfn\\p  =  \\pxlk\\  \\LkM\p  <  Ar  \\L~Krfn\\p , 

where  we  introduced  the  orthogonal  projector  Px  =  Id  —  Px- 

Now  let  us  consider  the  second  estimate.  Firstly  observe  that,  since  the  compact 
operators  Lk  and  T  have  a  common  eigensystem  of  functions  on  X,  then  Px  can  also  be 
seen  as  an  orthogonal  projector  in  Tt,  and  fx  £  Tt.  Hence  we  can  write, 


=  l|iVw||*< 


LK2  Pxfn 


< 


L 


—  ■k+r 


0A  (Lf 


\LKfn\\. 
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The  proof  is  concluded  observing  that  by  Prop.  2,  ||Lk||  =  ||T||  <  k,  and  that,  for 
every  a  £  [0,  k],  it  holds 


a  2+r0A(cr)  < 


A”  5 
_  1 
K  2 


+r 

+r 


if  r<  §, 

if  r  >  |  ■ 


□ 


Proposition  6  below  estimates  one  of  the  terms  appearing  in  the  proof  of  Theorem  2 
for  any  r  >  0.  The  case  r  >  |  had  already  been  analyzed  in  the  proof  of  Theorem  7  in 
[11- 

Proposition  6.  Let  r  >  0  and  define 

(15)  7*=A-1||T-TS||. 

Then,  if  A  €  (0,  k],  it  holds 

Vt  (<'.’a(7'x  )Tx  -  Id)  ft  |  <  BrCr(  1  +  Vt)(2  +  r7Ai-r  +  7”)^, 

II  llw 

where 


Proof.  The  two  inequalities  (16)  and  (17)  will  be  useful  in  the  proof.  The  first  follows 
from  Theorem  1  in  [15], 

(16)  ||Ta-T|'||<||T-Ti|r,  a  €[0,1] 

where  we  adopted  the  convection  0°  =  1.  The  second  is  a  corollary  of  Theorem  8.1  in  [2] 

(17)  \\TP  -  T|||  <Pkv~1\\T-  Till,  pen. 

We  also  need  to  introduce  the  orthogonal  projector  in  Tt ,  Pst.x,  defined  by 

Pic,  a  =  ©A  (Tic), 

with  ©a  defined  in  (14). 

We  analyze  the  cases  r  <  |  and  r  >  |  separately. 

Case  r  In  the  three  steps  below  we  subsequently  estimate  the  norms  of  the  three 
terms  of  the  expansion 


(18)  Vt  (G\(Tsi)  Tj  —  Id)  ft  =  VfP^r^T*)  ft 

+P*,xrxmT?  fl1 
+{^T  ~  ^)P^xrx{Tsf)  ft , 


where  P^x  =  Id  —  P^.x  and  rx{‘ 

a)  =  ctGa(ct)  -  1. 

Step  1:  Observe  that 

Vtp^a  2  = 

p^pp^a  <  sup  {<t>:^Sn 

< 

{4>,T^)n  ,  (</>,  (T  -  T*)<f>)n 

sup  - - 1-  sup  2 — 1 - q - £ — 

4>E Im  P± x  <f>€H  II  011 -ft 

< 

A  +  ||Tx  —  T\\  =  A(1  +  7). 

Therefore,  from  dehnitions  (2)  and  (4)  and  Proposition  5,  it  follows 

\\VfPt,xrxm  ft\\n  <  ||VtP^a||  IMPOII  \\ft\\n 
<  BrCr\J  1  +  7Ar. 
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Step  2:  Observe  that,  from  inequality  (16),  definition  (4)  and  Proposition  5 


T?  "fx 


(19) 


<  ||eA(T)||||r*-7«||w  + 

<  K~rCr(l+^-r). 


fj-r  _  J1.2 


II  ft 


Therefore  from  definition  (3),  it  follows 
P*,xrx(TSi)Tj[f£ 

<  BrCr{  l+^~r)\r. 

Step  3:  Recalling  the  definition  of  Ps.,\,  and  applying  again  inequality  (19)  and  Propo¬ 
sition  5,  we  get 


<  ||Px,  a  1111^(^)25 


T- 


1  rfx 


|  (VT  -  ^Pt'XrxiT*)  fx1 

<  {VT-x/T^T,  ' 

<  \\Vt-Vt^ 


T4  5+rPx, A 


Px,ArA(Tx)T-5 

rx(Ti 


n  rfx 


<  BrCr 75(l  +  75'r)Ar. 


Since  we  assumed  0  <  r  <  and  therefore  r/  =  1  —  r,  the  three  estimates  above  prove 
the  statement  of  the  Theorem  in  this  case. 

Case  r  >  ^ :  Consider  the  expansion 

(GA(7i)  T*  —  Id)  fx  <  rx(T*)Tr~iv 

<  rx(T*)T:~*v  +  rx{Ta)  (rr^i -T:~h-^v 

<  rx (T*)Trh  +  rx(T*)T?  (V“ I' ~p  -  T^~ *' ~P^j  v 

+rx(Ti)  (Tp  -  T?)Tr-^~pv 

where  v  =  PxT^~rfn,  r\(er)  =  <tGa(ct)  -  1  and  p  =  [r  -  |J. 

Now,  for  any  (3  £  [0,  |],  from  the  expansion  above  using  inequalities  (16)  and  (17),  and 
definition  (3),  we  get 


(20) 


(GA(T5)Ts-Id)/r  < 

H 


rx  (Tx)T-r 


-J+/3 


+  rx(Tz)T?+l3 


Tr~2~P  —  T~ 


+  ||rA(Ti)T|||  11^-21111^-1^1  Hw 

<  BrCrn~^+p  (Ar-!+/3(l  +  7r-i-P)+pA1+/37) 

<  BrCrK~^+P  (A-i+/3(l  +  71)  +  n\1+l3~r^  Ar. 

Finally,  from  the  expansion 

|Vr(GA(Ti)Tx-id)/r|w  <  \\Vt-V^\\  |Mwr||H 

+  ||^rA(T5)/r||  , 
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using  (16)  and  inequality  (20)  with  (3  =  0  and  (3  =  we  get  the  claimed  result  also  in 
this  case.  □ 

We  need  an  additional  preliminary  result. 

Proposition  7.  Let  the  operator  Ll\  be  defined  by 

(21)  Qx  =  VTGx(T*)(T*  +  \)(T  +  \)-i. 

Then,  if  A  £  (0,  n\,  it  holds 

II^aII  <  (1  +  2^/7 )  A , 

with  7  defined  in  eq.  (15). 

Proof.  First  consider  the  expansion 

fix  =  (Vr-VTi)  Ax  (t  +  a)-I  -  aa  (Vt-Vt*)  (T  +  xyi  +  Ax  VT(T  +  A)-!, 

where  we  introduced  the  operator 

Aa  =  Ga(Ts)  (li  +  A). 

By  condition  (2),  it  follows  ||Aa||  <  A.  Moreover,  from  inequality  (16) 

(22)  |Vr- <  \/\\T  —  T*||. 

From  the  previous  observations  we  easily  get 

||Oa||  <  2  II Aa||  Ia/T-a/T^I  ||(T  +  A)-5||  +  ||Aa||  ||\/T(T  +  A)T#  || 

<  A  (1  +  2^/7) , 

the  claimed  result. 

□ 

We  are  now  ready  to  show  the  proof  of  Theorem  2. 

Proof  of  Theorem  2.  We  consider  the  expansion 

y/T(U,x  -fn)  =  Vt  (Ga(Tx)  3z  -  fx)  +  Vf(fx  -  fn) 

=  nx  (T  +  A)  3  (/isA  -  $>A)  +  Vt  (Gx(Tjt)  T ’*  -  Id)  /Ir  +  %/f  (/j[r  -  fn) 
=  ((T  +  A) 3  (/i%  -  /Is)  +  (r  +  A)  a  (/Is  -  /Is)  +  (T  +  A)  *  (/Is  -  /£* 

+^T  (Gx(Ta)  T*  -  Id)  /Ir  +  VT  (ft1  -  fn) 

where  the  operator  Llx  is  defined  by  equation  (21),  the  ideal  RLS  estimators  are  fx  = 

(T  +  \)~1Tfn  and  =  (T  +  X)~1Tfx  ,  and  /I/A  =  (T*  +  A)_1T*/Ir  is  the  RLS  estimator 
constructed  by  the  training  set 

z  =  ((*1,  fx  (*1)) .  .,j(l™,  fx  (Jm))). 

Hence  we  get  the  following  decomposition, 

\\h.x-fn\\p  <  D[sls  +  R  +  Sls)+P  +  Ptr, 


(23) 
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with 


sls  = 

(t+a)5(/i:a-as)  |w, 

sls  = 

jj(T  +  A)i(/SJ,A-^)|K 

D  = 

ll«A|| , 

P  = 

|VT(GA(T*)T*-Id)/J;r 

ptr  _ 

R  =  (T  +  A)i(/x  —  fx) 


Terms  Sls  and  Sls  will  be  estimated  by  Proposition  1,  term  D  by  Proposition  7,  term 
P  by  Proposition  6  and  finally  terms  PtT  and  R  by  Proposition  5. 

Let  us  begin  with  the  estimates  of  S'155  and  S  .  First  observe  that,  by  the  same  reasoning 
in  the  proof  of  Theorem  1,  the  assumptions  of  the  Theorem  imply  inequality  (9)  in  the 
text  of  Proposition  1. 

Regarding  the  estimate  of  Sls.  Applying  Proposition  1  and  reasoning  as  in  the  proof 
of  Theorem  1  (recall  that  by  assumption  rh  >  mA^2-2’  s'+  >  mA_l1_2rl+  and  from 

Proposition  3,  y 'k  /as  <  Cr A  12  '  +  ),  we  get  that  with  probability  greater  than  1  —  5 


(24) 


Sls  < 


M  +  J™Cr A  12  r'+ 

m 


1 


2  7 K 

m  V  A 


A7(A) 


log5 


6 


<  8 (M  +  CT)-}=  I  ~^=  +  -%  I  log 
V™  \  \frnX  y/X7  >  6 


(eq.  (7)) 
r  +  s  >  X 


=  2  (M  +  Cr)  Xr  1  + 


A 


r-\-s—  7T 


2D|  log  ; 


<  3  (M  +  Cr)  Xr 


The  term  Sls  can  be  estimated  observing  that  z'  is  a  training  set  of  m  supervised 
samples  drawn  i.i.d.  from  the  probability  measure  p'  with  marginal  px  and  conditional 
P\x{y)  =  d(y  ~  f\(x))-  Therefore  the  regression  function  induced  by  p'  is  fpi  =  /Jr,  and 
the  support  of  p'  is  included  in  [— M',M']  x  X,  with  M'  =  sup xex  fp'(x)  < 

Again  applying  Proposition  1  and  reasoning  as  in  the  proof  of  Theorem  1,  we  obtain  that 
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with  probability  greater  than  1  —  S  it  holds 


(25) 


V  II  II ns  \m\jX  V  m  I  o 

<  i6^n^riiwfi-A/f+y^pVog? 


(Prop.  5) 


n  \  mV  A 


<  i6 ,/2&A-m+(l /;  + 

m  mV  A 


m 


lo®  5 


„  1  f  2  ,  D«  \  i  6 

<  16CV  — =  —==  +  — =  log  - 

a/W  5 


(eg.  (7)) 


r  +  S>  - 


=  4C'rAr  1  + 


<  6CVAr 


Ar+S- 2 

2£>?  log  f 


In  order  to  get  an  upper  bound  for  D  and  P,  we  have  first  to  estimate  the  quantity 
7  (see  definition  (15))  appearing  in  the  Propositions  6  and  7.  Our  estimate  for  7  follows 
from  Proposition  4  applied  to  the  random  variable  £  :  X  — >  £hs(77)  defined  by 

£(*)[•]  =  A"1*.  <**,■>„. 

We  can  set  H  =  ^  and  cr  =  -y ,  and  obtain  that  with  probability  greater  than  1  —  A 
7  <  A-*  ||T  -  Jill,;,  <  |  (|  +  -^)  log  |  <  4  ji.  log  | 

<  ll) - 4^ —  i0g  1  <  jt-'-it-O-'-D  <  <  1; 

yjm  0 

where  we  used  the  assumption  m  >  4  V m\~^2~2r~a^+  and  the  expression  for  A  in  the  text 
of  the  Theorem. 

Hence,  since  7  <  1,  from  Proposition  7  we  get 


(26) 

CO 

VI 

Q 

and  from  Proposition  6 

(27)  P 

< 

2BrCr{3  +  n\i~r)\r 

< 

2BrCr(3  +  rAlr+f_1l++2“r)Ar' 

< 

.  s+1 

2BrCr(3  +  rA  2  )Ar  <  2BrCr(3  +  r)\r 

Regarding  terms  PtT 

and  R.  From  Proposition  5  we  get 

(28) 

PtT  <  Cr\\ 

and  hence, 

(29)  R  =  |(T  +  A)-5T(/is-/jr)| 

II  IIh 

<  I  V/T(/a  —  /a)  I!  <Ptr<Cr  Xr. 

II  \\n 

The  proof  is  completed  by  plugging  inequalities  (24),  (25),  (26),  (27),  (28)  and  (29) 
in  (23)  and  recalling  the  expression  for  A.  □ 
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